{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SageMaker and AWS KMS–Managed Keys\n",
    "_**End-to-end encryption using SageMaker and KMS-Managed keys**_\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Setup](#Setup)\n",
    "1. [Optionally, upload encrypted data files for training](#Optionally,-upload-encrypted-data-files-for-training)\n",
    "1. [Training the XGBoost model](#Training-the-XGBoost-model)\n",
    "1. [Set up hosting for the model](#Set-up-hosting-for-the-model)\n",
    "1. [Validate the model for use](#Validate-the-model-for-use)\n",
    "1. [Run batch prediction using batch transform](#Run-batch-prediction-using-batch-transform)\n",
    "\n",
    "---\n",
    "## Background\n",
    "\n",
    "AWS Key Management Service ([AWS KMS](http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html)) enables \n",
    "Server-side encryption to protect your data at rest. Amazon SageMaker training works with KMS encrypted data if the IAM role used for S3 access has permissions to encrypt and decrypt data with the KMS key. Further, a KMS key can also be used to encrypt the model artifacts at rest using Amazon S3 server-side encryption. Additionally, a KMS key can also be used to encrypt the storage volume attached to training, endpoint, and transform instances. In this notebook, we demonstrate SageMaker encryption capabilities using KMS-managed keys. \n",
    "\n",
    "---\n",
    "\n",
    "## Setup\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "In order to successfully run this notebook, you must first:\n",
    "\n",
    "1. Have an existing KMS key from AWS IAM console or create one ([learn more](http://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html)).\n",
    "2. Allow the IAM role used for SageMaker to encrypt and decrypt data with this key from within applications and when using AWS services integrated with KMS ([learn more](http://docs.aws.amazon.com/console/kms/key-users)).\n",
    "3. Allow the IAM role for this notebook to create grants with this key ([learn more](https://docs.aws.amazon.com/sagemaker/latest/dg/api-permissions-reference.html)).\n",
    "\n",
    "We use the `key-id` from the KMS key ARN `arn:aws:kms:region:acct-id:key/key-id`.\n",
    "\n",
    "### General Setup\n",
    "Let's start by specifying:\n",
    "* AWS region.\n",
    "* The IAM role arn used to give learning and hosting access to your data. See the documentation for how to specify these.\n",
    "* The KMS key arn that you want to use for encryption.\n",
    "* The S3 bucket that you want to use for training and model data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "import os\n",
    "import io\n",
    "import boto3\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import re\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "role = get_execution_role()\n",
    "\n",
    "kms_key_arn = '<your-kms-key-arn>'\n",
    "\n",
    "bucket='<s3-bucket>' # put your s3 bucket name here, and create s3 bucket\n",
    "prefix = 'sagemaker/DEMO-kms'\n",
    "# customize to your bucket where you have stored the data\n",
    "bucket_path = 's3://{}'.format(bucket)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optionally, upload encrypted data files for training\n",
    "\n",
    "To demonstrate SageMaker training with KMS encrypted data, we first upload a toy dataset that has Server Side Encryption with customer provided key.\n",
    "\n",
    "### Data ingestion\n",
    "\n",
    "We, first, read the dataset from an existing repository into memory. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as the one used below, reading into memory isn't onerous, though it would be for larger datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_boston\n",
    "boston = load_boston()\n",
    "X = boston['data']\n",
    "y = boston['target']\n",
    "feature_names = boston['feature_names']\n",
    "data = pd.DataFrame(X, columns=feature_names)\n",
    "target = pd.DataFrame(y, columns={'MEDV'})\n",
    "data['MEDV'] = y\n",
    "local_file_name = 'boston.csv'\n",
    "data.to_csv(local_file_name, header=False, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data preprocessing\n",
    "\n",
    "Now that we have the dataset, we need to split it into *train*, *validation*, and *test* datasets which we can use to evaluate the accuracy of the machine learning algorithm. We'll also create a test dataset file with the labels removed so it can be fed into a batch transform job. We randomly split the dataset into 60% training, 20% validation and 20% test. Note that SageMaker Xgboost, expects the label column to be the first one in the datasets. So, we'll move the median value column (`MEDV`) from the last to the first position within the `write_file` method below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)\n",
    "X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def write_file(X, y, fname, include_labels=True):\n",
    "    feature_names = boston['feature_names']\n",
    "    data = pd.DataFrame(X, columns=feature_names)\n",
    "    if include_labels:\n",
    "        data.insert(0, 'MEDV', y)\n",
    "    data.to_csv(fname, header=False, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_file = 'train.csv'\n",
    "validation_file = 'val.csv'\n",
    "test_file = 'test.csv'\n",
    "test_no_labels_file = 'test_no_labels.csv'\n",
    "write_file(X_train, y_train, train_file)\n",
    "write_file(X_val, y_val, validation_file)\n",
    "write_file(X_test, y_test, test_file)\n",
    "write_file(X_test, y_test, test_no_labels_file, False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data upload to S3 with Server Side Encryption"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.client('s3')\n",
    "\n",
    "data_train = open(train_file, 'rb')\n",
    "key_train = '{}/train/{}'.format(prefix,train_file)\n",
    "kms_key_id = kms_key_arn.split(':key/')[1]\n",
    "\n",
    "print(\"Put object...\")\n",
    "s3.put_object(Bucket=bucket,\n",
    "              Key=key_train,\n",
    "              Body=data_train,\n",
    "              ServerSideEncryption='aws:kms',\n",
    "              SSEKMSKeyId=kms_key_id)\n",
    "print(\"Done uploading the training dataset\")\n",
    "\n",
    "data_validation = open(validation_file, 'rb')\n",
    "key_validation = '{}/validation/{}'.format(prefix,validation_file)\n",
    "\n",
    "print(\"Put object...\")\n",
    "s3.put_object(Bucket=bucket,\n",
    "              Key=key_validation,\n",
    "              Body=data_validation,\n",
    "              ServerSideEncryption='aws:kms',\n",
    "              SSEKMSKeyId=kms_key_id)\n",
    "\n",
    "print(\"Done uploading the validation dataset\")\n",
    "\n",
    "data_test = open(test_no_labels_file, 'rb')\n",
    "key_test = '{}/test/{}'.format(prefix,test_no_labels_file)\n",
    "\n",
    "print(\"Put object...\")\n",
    "s3.put_object(Bucket=bucket,\n",
    "              Key=key_test,\n",
    "              Body=data_test,\n",
    "              ServerSideEncryption='aws:kms',\n",
    "              SSEKMSKeyId=kms_key_id)\n",
    "\n",
    "print(\"Done uploading the test dataset\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the SageMaker XGBoost model\n",
    "\n",
    "Now that we have our data in S3, we can begin training. We'll use Amazon SageMaker XGboost algorithm as an example to demonstrate model training. Note that nothing needs to be changed in the way you'd call the training algorithm. The only requirement for training to succeed is that the IAM role (`role`) used for S3 access has permissions to encrypt and decrypt data with the KMS key (`kms_key_arn`). You can set these permissions using the instructions [here](http://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html#key-policy-default-allow-users). If the permissions aren't set, you'll get the `Data download failed` error. Specify a `VolumeKmsKeyId` in the training job parameters to have the volume attached to the ML compute instance encrypted using key provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "container = get_image_uri(boto3.Session().region_name, 'xgboost')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from time import gmtime, strftime\n",
    "import time\n",
    "\n",
    "job_name = 'DEMO-xgboost-single-regression' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "print(\"Training job\", job_name)\n",
    "\n",
    "create_training_params = \\\n",
    "{\n",
    "    \"AlgorithmSpecification\": {\n",
    "        \"TrainingImage\": container,\n",
    "        \"TrainingInputMode\": \"File\"\n",
    "    },\n",
    "    \"RoleArn\": role,\n",
    "    \"OutputDataConfig\": {\n",
    "        \"S3OutputPath\": bucket_path + \"/\"+ prefix + \"/output\"\n",
    "    },\n",
    "    \"ResourceConfig\": {\n",
    "        \"InstanceCount\": 1,\n",
    "        \"InstanceType\": \"ml.m4.4xlarge\",\n",
    "        \"VolumeSizeInGB\": 5,\n",
    "        \"VolumeKmsKeyId\": kms_key_arn\n",
    "    },\n",
    "    \"TrainingJobName\": job_name,\n",
    "    \"HyperParameters\": {\n",
    "        \"max_depth\":\"5\",\n",
    "        \"eta\":\"0.2\",\n",
    "        \"gamma\":\"4\",\n",
    "        \"min_child_weight\":\"6\",\n",
    "        \"subsample\":\"0.7\",\n",
    "        \"silent\":\"0\",\n",
    "        \"objective\":\"reg:linear\",\n",
    "        \"num_round\":\"5\"\n",
    "    },\n",
    "    \"StoppingCondition\": {\n",
    "        \"MaxRuntimeInSeconds\": 86400\n",
    "    },\n",
    "    \"InputDataConfig\": [\n",
    "        {\n",
    "            \"ChannelName\": \"train\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": bucket_path + \"/\"+ prefix + '/train',\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"csv\",\n",
    "            \"CompressionType\": \"None\"\n",
    "        },\n",
    "        {\n",
    "            \"ChannelName\": \"validation\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": bucket_path + \"/\"+ prefix + '/validation',\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\"\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"csv\",\n",
    "            \"CompressionType\": \"None\"\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "client = boto3.client('sagemaker')\n",
    "client.create_training_job(**create_training_params)\n",
    "\n",
    "try:\n",
    "    # wait for the job to finish and report the ending status\n",
    "    client.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)\n",
    "    training_info = client.describe_training_job(TrainingJobName=job_name)\n",
    "    status = training_info['TrainingJobStatus']\n",
    "    print(\"Training job ended with status: \" + status)\n",
    "except:\n",
    "    print('Training failed to start')\n",
    "     # if exception is raised, that means it has failed\n",
    "    message = client.describe_training_job(TrainingJobName=job_name)['FailureReason']\n",
    "    print('Training failed with the following error: {}'.format(message))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up hosting for the model\n",
    "In order to set up hosting, we have to import the model from training to hosting. \n",
    "\n",
    "### Import model into hosting\n",
    "\n",
    "Register the model with hosting. This allows the flexibility of importing models trained elsewhere."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "model_name=job_name + '-model'\n",
    "print(model_name)\n",
    "\n",
    "info = client.describe_training_job(TrainingJobName=job_name)\n",
    "model_data = info['ModelArtifacts']['S3ModelArtifacts']\n",
    "print(model_data)\n",
    "\n",
    "primary_container = {\n",
    "    'Image': container,\n",
    "    'ModelDataUrl': model_data\n",
    "}\n",
    "\n",
    "create_model_response = client.create_model(\n",
    "    ModelName = model_name,\n",
    "    ExecutionRoleArn = role,\n",
    "    PrimaryContainer = primary_container)\n",
    "\n",
    "print(create_model_response['ModelArn'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint configuration\n",
    "\n",
    "SageMaker supports configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way. In addition, the endpoint configuration describes the instance type required for model deployment and the key used to encrypt the volume attached to the endpoint instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import gmtime, strftime\n",
    "\n",
    "endpoint_config_name = 'DEMO-XGBoostEndpointConfig-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "print(endpoint_config_name)\n",
    "create_endpoint_config_response = client.create_endpoint_config(\n",
    "    EndpointConfigName = endpoint_config_name,\n",
    "    KmsKeyId = kms_key_arn,\n",
    "    ProductionVariants=[{\n",
    "        'InstanceType':'ml.m4.xlarge',\n",
    "        'InitialVariantWeight':1,\n",
    "        'InitialInstanceCount':1,\n",
    "        'ModelName':model_name,\n",
    "        'VariantName':'AllTraffic'}])\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response['EndpointConfigArn'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint\n",
    "Create the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import time\n",
    "\n",
    "endpoint_name = 'DEMO-XGBoostEndpoint-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "print(endpoint_name)\n",
    "create_endpoint_response = client.create_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    EndpointConfigName=endpoint_config_name)\n",
    "print(create_endpoint_response['EndpointArn'])\n",
    "\n",
    "\n",
    "print('EndpointArn = {}'.format(create_endpoint_response['EndpointArn']))\n",
    "\n",
    "# get the status of the endpoint\n",
    "response = client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = response['EndpointStatus']\n",
    "print('EndpointStatus = {}'.format(status))\n",
    "\n",
    "\n",
    "# wait until the status has changed\n",
    "client.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)\n",
    "\n",
    "\n",
    "# print the status of the endpoint\n",
    "endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = endpoint_response['EndpointStatus']\n",
    "print('Endpoint creation ended with EndpointStatus = {}'.format(status))\n",
    "\n",
    "if status != 'InService':\n",
    "    raise Exception('Endpoint creation failed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validate the model for use\n",
    "You can now validate the model for use. Obtain the endpoint from the client library using the result from previous operations, and run a single prediction on the trained model using that endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "runtime_client = boto3.client('runtime.sagemaker')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import math\n",
    "def do_predict(data, endpoint_name, content_type):\n",
    "    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, \n",
    "                                   ContentType=content_type, \n",
    "                                   Body=data)\n",
    "    result = response['Body'].read()\n",
    "    result = result.decode(\"utf-8\")\n",
    "    return result\n",
    "\n",
    "# pull the first item from the test dataset\n",
    "with open('test.csv') as f:\n",
    "    first_line = f.readline()\n",
    "    features = first_line.split(',')[1:]\n",
    "    feature_str = ','.join(features)\n",
    "\n",
    "prediction = do_predict(feature_str, endpoint_name, 'text/csv')\n",
    "print('Prediction: ' + prediction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Optional) Delete the Endpoint\n",
    "\n",
    "If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run batch prediction using batch transform\n",
    "Create a transform job to do batch prediction using the trained model. Similar to the training section above, the execution role assumed by this notebook must have permissions to encrypt and decrypt data with the KMS key (`kms_key_arn`) used for S3 server-side encryption. Similar to training, specify a `VolumeKmsKeyId` so that the volume attached to the transform instance is encrypted using the key provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "transform_job_name = 'DEMO-xgboost-batch-prediction' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "print(\"Transform job\", transform_job_name)\n",
    "\n",
    "transform_params = \\\n",
    "{\n",
    "    \"TransformJobName\": transform_job_name,\n",
    "    \"ModelName\": model_name,\n",
    "    \"TransformInput\": {\n",
    "        \"ContentType\": \"text/csv\",\n",
    "        \"DataSource\": {\n",
    "            \"S3DataSource\": {\n",
    "                \"S3DataType\": \"S3Prefix\",\n",
    "                \"S3Uri\": bucket_path + \"/\"+ prefix + '/test'\n",
    "            }\n",
    "        },\n",
    "        \"SplitType\": \"Line\"\n",
    "    },\n",
    "    \"TransformOutput\": {\n",
    "        \"AssembleWith\": \"Line\",\n",
    "        \"S3OutputPath\": bucket_path + \"/\"+ prefix + '/predict'\n",
    "    },\n",
    "    \"TransformResources\": {\n",
    "        \"InstanceCount\": 1,\n",
    "        \"InstanceType\": \"ml.c4.xlarge\",\n",
    "        \"VolumeKmsKeyId\": kms_key_arn\n",
    "    }\n",
    "}\n",
    "\n",
    "client.create_transform_job(**transform_params)\n",
    "\n",
    "while True:\n",
    "    response = client.describe_transform_job(TransformJobName=transform_job_name)\n",
    "    status = response['TransformJobStatus']\n",
    "    if status == 'InProgress':\n",
    "        time.sleep(15)\n",
    "    elif status == 'Completed':\n",
    "        print(\"Transform job completed!\")\n",
    "        break\n",
    "    else:\n",
    "        print(\"Unexpected transform job status: \" + status)\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the batch predictions\n",
    "\n",
    "The following helps us calculate the Median Absolute Percent Error (MdAPE) on the batch prediction output in S3. Note that the intent of this example is not to produce the most accurate regressor but to demonstrate how to handle KMS encrypted data with SageMaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Downloading prediction object...\")\n",
    "s3.download_file(Bucket=bucket,\n",
    "                 Key=prefix + '/predict/' + test_no_labels_file + '.out',\n",
    "                 Filename='./predictions.csv')\n",
    "\n",
    "preds = np.loadtxt('predictions.csv')\n",
    "print('\\nMedian Absolute Percent Error (MdAPE) = ', np.median(np.abs(y_test - preds) / y_test))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
